34 research outputs found

    Cppless: Productive and Performant Serverless Programming in C++

    Full text link
    The rise of serverless introduced a new class of scalable, elastic and highly available parallel workers in the cloud. Many systems and applications benefit from offloading computations and parallel tasks to dynamically allocated resources. However, the developers of C++ applications found it difficult to integrate functions due to complex deployment, lack of compatibility between client and cloud environments, and loosely typed input and output data. To enable single-source and efficient serverless acceleration in C++, we introduce Cppless, an end-to-end framework for implementing serverless functions which handles the creation, deployment, and invocation of functions. Cppless is built on top of LLVM and requires only two compiler extensions to automatically extract C++ function objects and deploy them to the cloud. We demonstrate that offloading parallel computations from a C++ application to serverless workers can provide up to 30x speedup, requiring only minor code modifications and costing less than one cent per computation

    RFaaS: RDMA-Enabled FaaS Platform for Serverless High-Performance Computing

    Full text link
    The rigid MPI programming model and batch scheduling dominate high-performance computing. While clouds brought new levels of elasticity into the world of computing, supercomputers still suffer from low resource utilization rates. To enhance supercomputing clusters with the benefits of serverless computing, a modern cloud programming paradigm for pay-as-you-go execution of stateless functions, we present rFaaS, the first RDMA-aware Function-as-a-Service (FaaS) platform. With hot invocations and decentralized function placement, we overcome the major performance limitations of FaaS systems and provide low-latency remote invocations in multi-tenant environments. We evaluate the new serverless system through a series of microbenchmarks and show that remote functions execute with negligible performance overheads. We demonstrate how serverless computing can bring elastic resource management into MPI-based high-performance applications. Overall, our results show that MPI applications can benefit from modern cloud programming paradigms to guarantee high performance at lower resource costs

    Bridging Control-Centric and Data-Centric Optimization

    Full text link
    With the rise of specialized hardware and new programming languages, code optimization has shifted its focus towards promoting data locality. Most production-grade compilers adopt a control-centric mindset - instruction-driven optimization augmented with scalar-based dataflow - whereas other approaches provide domain-specific and general purpose data movement minimization, which can miss important control-flow optimizations. As the two representations are not commutable, users must choose one over the other. In this paper, we explore how both control- and data-centric approaches can work in tandem via the Multi-Level Intermediate Representation (MLIR) framework. Through a combination of an MLIR dialect and specialized passes, we recover parametric, symbolic dataflow that can be optimized within the DaCe framework. We combine the two views into a single pipeline, called DCIR, showing that it is strictly more powerful than either view. On several benchmarks and a real-world application in C, we show that our proposed pipeline consistently outperforms MLIR and automatically uncovers new optimization opportunities with no additional effort.Comment: CGO'2

    User-guided Page Merging for Memory Deduplication in Serverless Systems

    Full text link
    Serverless computing is an emerging cloud paradigm that offers an elastic and scalable allocation of computing resources with pay-as-you-go billing. In the Function-as-a-Service (FaaS) programming model, applications comprise short-lived and stateless serverless functions executed in isolated containers or microVMs, which can quickly scale to thousands of instances and process terabytes of data. This flexibility comes at the cost of duplicated runtimes, libraries, and user data spread across many function instances, and cloud providers do not utilize this redundancy. The memory footprint of serverless forces removing idle containers to make space for new ones, which decreases performance through more cold starts and fewer data caching opportunities. We address this issue by proposing deduplicating memory pages of serverless workers with identical content, based on the content-based page-sharing concept of Linux Kernel Same-page Merging (KSM). We replace the background memory scanning process of KSM, as it is too slow to locate sharing candidates in short-lived functions. Instead, we design User-Guided Page Merging (UPM), a built-in Linux kernel module that leverages the madvise system call: we enable users to advise the kernel of memory areas that can be shared with others. We show that UPM reduces memory consumption by up to 55% on 16 concurrent containers executing a typical image recognition function, more than doubling the density for containers of the same function that can run on a system.Comment: Accepted at IEEE BigData 202

    Performance-Detective: Automatic Deduction of Cheap and Accurate Performance Models

    Get PDF
    The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements

    Automatic Empirical Performance Modeling of Parallel Programs

    Get PDF
    Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes or solving larger problems. Often, such performance bugs manifest themselves only when the code is put into production, a point where remediation can be difficult. Manually creating analytical performance models provides insights into optimization opportunities but is extremely costly if done for applications of realistic size. The effort limits application developers to only attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. Furthermore, tuning large applications requires a clever exploration of the design and configuration space. Especially on supercomputers, this space is so large that its exhaustive traversal via performance experiments becomes too expensive, if not impossible. If we have to consider multiple performance-relevant parameters and their possible interactions at the same time, a common requirement in many situations, this task becomes even more complex. The initial contribution of this thesis is a method to substantially improve both coverage and speed of performance modeling and analysis. Generating an empirical performance model automatically for each part of a parallel program with respect to the variation of a relevant parameter such as process count or problem size, it becomes possible to easily identify those parts that will reduce performance at larger core counts or when solving a bigger problem. In the next step, we extended the approach with a method capable of modeling any combination of multiple execution parameters simultaneously, provided sufficient performance measurements are available. Multi-parameter modeling has so far been outside the reach of automatic methods due to the exponential growth of the model search space. Specialized heuristics developed as part of this work traverse the search space rapidly and generate insightful performance models that enable a wide range of uses from performance predictions for balanced machine design to performance tuning. Finally we present a method that employs automated performance modeling to quickly predict application requirements for varying scales and problem sizes. Following this approach, it is possible to determine future requirements of major scientific applications, derive an optimization strategy, and illustrate system design tradeoffs in the light of their requirements. This supports the co-design process by informing hardware acquisition decisions with the actual needs of the software. The methods described in this work are implemented in the performance analysis tool Extra-P. Extra-P has been released as open source and has been successfully used to gain insight into the performance of numerous scientific applications from a large range of fields. Since its release, Extra-P has an impact on the HPC community. Developers at both universities and research centers have used Extra-P to better understand the performance of their research codes. Tutorials on the use of Extra-P have been offered at international conferences such as EuroMPI and Supercomputing further demonstrating the effectiveness of this approach in making performance modeling available to developers without requiring expert knowledge of the topic. This work simplifies and streamlines the performance modeling process, offering insights into application behavior quickly and automatically and allowing the developer to focus on transforming these insights into tangible performance improvements
    corecore